Skip to content

ci: test gpu on self-hosted runners #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

galargh
Copy link
Contributor

@galargh galargh commented Jun 28, 2025

Related to filecoin-project/rust-fil-proofs#1775

Similar to filecoin-project/rust-fil-proofs#1785

This PR enables the job that requires running on a machine with a GPU. It will run on a g6e.2xlarge runner.

@galargh
Copy link
Contributor Author

galargh commented Jun 28, 2025

I see that not only the job that I'm migrating to self-hosted runners, but also Clippy and Test in release mode on MacOS have started failing now with the same error:

error[E0282]: type annotations needed
   --> src/seal.rs:339:5
    |
339 |     filecoin_proofs_v1::clear_cache(cache_path)
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot infer type of the type parameter `Tree` declared on the function `clear_cache`
    |
help: consider specifying the generic argument
    |
339 |     filecoin_proofs_v1::clear_cache::<Tree>(cache_path)
    |                                    ++++++++

error[E0282]: type annotations needed
   --> src/seal.rs:426:5
    |
426 |     filecoin_proofs_v1::clear_synthetic_proofs(cache_path)
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot infer type of the type parameter `Tree` declared on the function `clear_synthetic_proofs`
    |
help: consider specifying the generic argument
    |
426 |     filecoin_proofs_v1::clear_synthetic_proofs::<Tree>(cache_path)
    |                                               ++++++++

Last month they were still passing on the default branch: https://github.com/filecoin-project/rust-filecoin-proofs-api/actions/runs/15213629917

@vmx Do you have an idea why that might be?

cc @BigLep

@vmx
Copy link
Contributor

vmx commented Jun 30, 2025

Things should work, but the current master patches the rust-fil-proofs crates to point to the master branch:

[patch.crates-io]
filecoin-proofs = { git = "https://github.com/filecoin-project/rust-fil-proofs" }
fr32 = { git = "https://github.com/filecoin-project/rust-fil-proofs" }
filecoin-hashers = { git = "https://github.com/filecoin-project/rust-fil-proofs" }
storage-proofs-core = { git = "https://github.com/filecoin-project/rust-fil-proofs" }
.

As there was a release of rust-fil-proofs, this repo should be updated to use the released version. So I suggest that the current maintainers update to those versions and the we'll see if things still fail.

@galargh
Copy link
Contributor Author

galargh commented Jul 7, 2025

I added a release commit here - c5246a9 - and the workflow now passes.

Now, the question is, what the release process is that we should follow? I see that the previous tags were created manually - https://github.com/filecoin-project/rust-filecoin-proofs-api/tags

@galargh galargh marked this pull request as ready for review July 7, 2025 13:42
@galargh galargh requested review from Copilot, vmx and BigLep July 7, 2025 13:42
Copilot

This comment was marked as outdated.

@BigLep
Copy link
Member

BigLep commented Jul 7, 2025

Thanks for the updates @galargh.

@galargh : Would it maybe make sense to break this into two PRs (one for the release, and one for the CI adjustment)?

@vmx : are there any steps we should follow for making releases (e.g., any cargo release commands like with rust-fil-proofs)?

@BigLep BigLep requested a review from rvagg July 7, 2025 16:56
@BigLep BigLep added this to FilOz Jul 7, 2025
@github-project-automation github-project-automation bot moved this to 📌 Triage in FilOz Jul 7, 2025
@BigLep BigLep moved this from 📌 Triage to ⌨️ In Progress in FilOz Jul 7, 2025
@rvagg
Copy link
Member

rvagg commented Jul 8, 2025

The release commit in here had me confused, but I think we're just trying to do two separate things at once? I don't think I mind as long as it's not squash merged, but it would have been clearer for reviewing if they were separate PRs.

My only question is about the nvidia drivers--are they already installed on the standard GitHub machines but not on the current AMI that we have access to?

@vmx
Copy link
Contributor

vmx commented Jul 8, 2025

I would do the release separately. It's done similarly as for rust-fil-proofs. It's even simpler. It's a matter of cargo release major. It would do the version bump in the Cargo.toml, push that tags etc.

@galargh
Copy link
Contributor Author

galargh commented Jul 15, 2025

The release commit in here had me confused, but I think we're just trying to do two separate things at once? I don't think I mind as long as it's not squash merged, but it would have been clearer for reviewing if they were separate PRs.

Yes, that was the plan all along. I just wanted to see whether the release fixes the build on self-hosted runners. Here's the separate PR that I just created: #109

My only question is about the nvidia drivers--are they already installed on the standard GitHub machines but not on the current AMI that we have access to?

The hosted runners do not have GPUs. On self-hosted ones, we do have dedicated GPUs - that's what we install the drivers for.

@BigLep BigLep moved this from ⌨️ In Progress to 🔎 Awaiting Review in FilOz Jul 15, 2025
@BigLep
Copy link
Member

BigLep commented Jul 22, 2025

@galargh : It looks like the content of #109 is still in this PR. Once it's removed here, I tink we can approve and get this merged.

@BigLep BigLep moved this from 🔎 Awaiting Review to ⌨️ In Progress in FilOz Jul 22, 2025
@galargh galargh marked this pull request as draft July 23, 2025 13:57
@galargh
Copy link
Contributor Author

galargh commented Jul 23, 2025

@galargh : It looks like the content of #109 is still in this PR. Once it's removed here, I tink we can approve and get this merged.

I moved this one back to draft for now as it's effectively blocked until we get the release out.

@galargh galargh marked this pull request as ready for review August 10, 2025 16:10
@BigLep BigLep requested a review from Copilot August 10, 2025 22:33
@BigLep BigLep moved this from ⌨️ In Progress to 🔎 Awaiting Review in FilOz Aug 10, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables GPU testing on self-hosted runners by configuring the CI workflow to run on AWS g6e.2xlarge instances with GPU support. The changes address the need to test GPU functionality that cannot be executed on standard GitHub-hosted runners.

  • Migrates the test job from ubuntu-24.04 to self-hosted runners with GPU capabilities
  • Adds CUDA driver installation for the self-hosted environment
  • Updates package installation commands to use apt-get for better reliability

# https://www.nvidia.com/en-us/drivers/
- name: Install CUDA drivers
run: |
curl -L -o nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb https://us.download.nvidia.com/tesla/570.148.08/nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
Copy link
Preview

Copilot AI Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading executable content over HTTP without integrity verification is a security risk. Consider adding SHA256 checksum verification after the curl command to ensure the downloaded file hasn't been tampered with.

Suggested change
curl -L -o nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb https://us.download.nvidia.com/tesla/570.148.08/nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
curl -L -o nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb https://us.download.nvidia.com/tesla/570.148.08/nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
# Verify SHA256 checksum
echo "b1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2 nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb" > nvidia-driver.sha256
sha256sum -c nvidia-driver.sha256

Copilot uses AI. Check for mistakes.

run: |
sudo apt-get update
sudo apt-get install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
# TODO: Remove this and other rust installation directives from jobs running
Copy link
Preview

Copilot AI Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment is incomplete and unclear. It should specify what should be removed and why, or what the complete sentence should be.

Suggested change
# TODO: Remove this and other rust installation directives from jobs running
# TODO: Once the AMI includes the Rust toolchain, remove this step and any other Rust installation directives from CI jobs to avoid redundant installations and speed up workflow execution.

Copilot uses AI. Check for mistakes.

run: |
sudo apt-get update
sudo apt-get install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
# TODO: Remove this and other rust installation directives from jobs running
Copy link
Preview

Copilot AI Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using a specific commit SHA for the action is good for security, but consider adding a comment explaining why this specific version is pinned, especially since the TODO above mentions removing rust installation directives.

Suggested change
# TODO: Remove this and other rust installation directives from jobs running
# TODO: Remove this and other rust installation directives from jobs running
# Pinned to a specific commit SHA for security and reproducibility.
# This version was chosen to ensure compatibility with the workflow; update only after verifying changes.

Copilot uses AI. Check for mistakes.

Comment on lines +72 to +74
- uses: dtolnay/rust-toolchain@21dc36fb71dd22e3317045c0c31a3f4249868b17
with:
toolchain: 1.83
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of sucks that we can't just use the rust-toolchain file for versioning, but note from https://github.com/dtolnay/rust-toolchain?tab=readme-ov-file#inputs about versioning:

Rustup toolchain specifier e.g. stable, nightly, 1.42.0, nightly-2022-01-01. Important: the default is to match the @Rev as described above. When passing an explicit toolchain as an input instead of @Rev, you'll want to use "dtolnay/rust-toolchain@master" as the revision of the action.

i.e. it wants you to use dtolnay/[email protected] instead.

(I also notice other poeple are annoyed by this gap).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔎 Awaiting Review
Development

Successfully merging this pull request may close these issues.

4 participants